-
Notifications
You must be signed in to change notification settings - Fork 32
🎨Computational backend: Make sure the number of threads of a dask-worker is computed for autoscaling 🚨🚨🚨 #8423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🎨Computational backend: Make sure the number of threads of a dask-worker is computed for autoscaling 🚨🚨🚨 #8423
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #8423 +/- ##
==========================================
+ Coverage 87.01% 87.44% +0.43%
==========================================
Files 2011 1604 -407
Lines 78602 66901 -11701
Branches 1348 761 -587
==========================================
- Hits 68392 58499 -9893
+ Misses 9807 8149 -1658
+ Partials 403 253 -150
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
1eb7d20 to
9b3ec9f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the computational backend to properly compute and track the number of threads for dask-workers in the autoscaling system. The key changes involve adding support for generic resources (particularly threads) to the Resources model and extending the Dask monitoring configuration.
- Add generic resources support to the Resources model with proper arithmetic operations
- Introduce DASK_NTHREADS and DASK_NTHREADS_MULTIPLIER configuration settings
- Update resource comparison and computation logic to handle the new generic resources
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| services/clusters-keeper/src/simcore_service_clusters_keeper/data/docker-compose.yml | Adds DASK_NTHREADS environment variables to the compose file |
| services/autoscaling/src/simcore_service_autoscaling/core/settings.py | Introduces new Dask thread configuration settings |
| packages/aws-library/src/aws_library/ec2/_models.py | Extends Resources model with generic_resources field and related operations |
| services/autoscaling/src/simcore_service_autoscaling/modules/dask.py | Adds function to compute instance thread resources |
| services/autoscaling/src/simcore_service_autoscaling/utils/cluster_scaling.py | Updates resource comparison logic |
| services/autoscaling/tests/unit/test_modules_cluster_scaling_computational.py | Refactors resource mapping logic and updates tests |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
ec5ef47 to
f68a40d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
.../autoscaling/src/simcore_service_autoscaling/modules/cluster_scaling/_utils_computational.py
Outdated
Show resolved
Hide resolved
...toscaling/src/simcore_service_autoscaling/modules/cluster_scaling/_provider_computational.py
Outdated
Show resolved
Hide resolved
8b7db41 to
83bf3b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
services/autoscaling/src/simcore_service_autoscaling/core/errors.py
Outdated
Show resolved
Hide resolved
2c11b7e to
0dcad8e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 7 comments.
Comments suppressed due to low confidence (1)
services/autoscaling/src/simcore_service_autoscaling/modules/dask.py:1
- Unconditionally injecting a thread resource of 1 per task in both processing and unrunnable lists duplicates logic and may misrepresent tasks that already define a thread-related generic resource. Centralizing this augmentation (e.g. via a helper that only adds the key if absent) reduces duplication and prevents accidental overwrites.
import collections
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
.../autoscaling/src/simcore_service_autoscaling/modules/cluster_scaling/_utils_computational.py
Show resolved
Hide resolved
.../autoscaling/src/simcore_service_autoscaling/modules/cluster_scaling/_utils_computational.py
Show resolved
Hide resolved
services/autoscaling/tests/unit/test_modules_cluster_scaling_computational.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 28 out of 28 changed files in this pull request and generated 5 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
packages/dask-task-models-library/src/dask_task_models_library/resource_constraints.py
Outdated
Show resolved
Hide resolved
services/autoscaling/src/simcore_service_autoscaling/modules/dask.py
Outdated
Show resolved
Hide resolved
...ces/autoscaling/src/simcore_service_autoscaling/modules/cluster_scaling/_provider_dynamic.py
Outdated
Show resolved
Hide resolved
...toscaling/src/simcore_service_autoscaling/modules/cluster_scaling/_provider_computational.py
Outdated
Show resolved
Hide resolved
services/autoscaling/tests/unit/test_modules_cluster_scaling_computational.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 28 out of 28 changed files in this pull request and generated 8 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
.../autoscaling/src/simcore_service_autoscaling/modules/cluster_scaling/_utils_computational.py
Outdated
Show resolved
Hide resolved
32d9139 to
cbe908b
Compare
…/resource_constraints.py Co-authored-by: Copilot <[email protected]>
…ask.py Co-authored-by: Copilot <[email protected]>
…luster_scaling/_utils_computational.py Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
7f30e5d to
f2bcb52
Compare
|
…dask-worker is computed for autoscaling 🚨🚨🚨 (ITISFoundation#8423)" This reverts commit ffe52c1.
…ker is computed for autoscaling 🚨🚨🚨 (ITISFoundation#8423) Co-authored-by: Copilot <[email protected]>
…ker is computed for autoscaling 🚨🚨🚨 (ITISFoundation#8423) Co-authored-by: Copilot <[email protected]>
…ker is computed for autoscaling 🚨🚨🚨 (ITISFoundation#8423) Co-authored-by: Copilot <[email protected]>
…ker is computed for autoscaling 🚨🚨🚨 (ITISFoundation#8423) Co-authored-by: Copilot <[email protected]>



What do these changes do?
The
autoscalingservice when running in computational mode tries to understand via the dask client what are the needs for EC2 instances.autoscalingchecks it fits,autoscalingfinds the best suitable EC2 instance based on the resources,For non-billable systems:
Until this PR, the
autoscalingservice would estimate the available resources from an EC2 using what the AWS EC2 API returns, which is not exactly whatDockeror evenDaskthen sees once the EC2 instance is up and running.This would sometimes create dead locks where a machine that should in theory handle the task would actually not since the docker engine and/or dask worker "sees" a bit less memory or cpus. This PR shall correct this fact by using the same computations everywhere.
**computational provider (a.k.a. dask)
Every dask-worker has a defined number of
threads, a.k.a. the theoretical number of jobs that can be completed in parallel.The dask-worker in our implementation either takes what CPUs it finds, or overrides it with
DASK_NTHREADSenvironement, or use theDASK_NTHREADS_MULTIPLIERenvironment.The meaning of this is that even if a user wants to run 50 tasks requiring 0.1 CPU and a machine has 20 CPUs, it cannot run more than
nthreadsin parallel.This PR allows now the
autoscalingservice to understand this concept by allowing so-called "generic resources". This will also open the door to add GPU support and any kind of resource.🚨🚨🚨 some caution on deployment to ensure everything runs as smooth as possible
Related issue/s
How to test
Dev-ops